I decided to work on the topic of agriculture in India, this being I am interested in trying to see if with my model's final solution can later be enhanced to give small farmers in under developed countries like mine a chance to have a better agricultural system
But at this stage, form this project I would like to gather alot of the skills that is involved in creating an AI via the use of machine learning (supervised learning), Exploratory data analysis approach and Ethics(Societal impact).
My predictive goal is to predict how much a farmer will produce based on some input, thereby helping them have a good idea on the best periods that will yield a good amount of crops/produce.
Name: Osuntuyi Michael
Data collection:
The data was gathered from
https://data.gov.in/
Dataset description:
This dataset contains information about state and districts in india how much crop they produced, the name of the crop, in what year did they yield the inputted amount production(tonne), how much area(hectare) of land was required to acheive the amount podueced,
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import squarify
Here I input new names for Area and Production, this is because I would like to know at all times
the unit of measurement for these columns
column_names = ['State_Name', 'District_Name', 'Crop_Year', 'Season', 'Crop', 'Area(ha)', 'Production(tonne)']
Importing dataset gotten fromt eh india government dataset official website
df = pd.read_csv('crop_production.csv', names = column_names)
df
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\IPython\core\interactiveshell.py:3441: DtypeWarning: Columns (2,5,6) have mixed types.Specify dtype option on import or set low_memory=False. exec(code_obj, self.user_global_ns, self.user_ns)
| State_Name | District_Name | Crop_Year | Season | Crop | Area(ha) | Production(tonne) | |
|---|---|---|---|---|---|---|---|
| 0 | State_Name | District_Name | Crop_Year | Season | Crop | Area | Production |
| 1 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Arecanut | 1254.00 | 2000.00 |
| 2 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Other Kharif pulses | 2.00 | 1.00 |
| 3 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Rice | 102.00 | 321.00 |
| 4 | Andaman and Nicobar Islands | NICOBARS | 2000 | Whole Year | Banana | 176.00 | 641.00 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 246087 | West Bengal | PURULIA | 2014 | Summer | Rice | 306.0 | 801.0 |
| 246088 | West Bengal | PURULIA | 2014 | Summer | Sesamum | 627.0 | 463.0 |
| 246089 | West Bengal | PURULIA | 2014 | Whole Year | Sugarcane | 324.0 | 16250.0 |
| 246090 | West Bengal | PURULIA | 2014 | Winter | Rice | 279151.0 | 597899.0 |
| 246091 | West Bengal | PURULIA | 2014 | Winter | Sesamum | 175.0 | 88.0 |
246092 rows × 7 columns
After importing the dataset, I noticed row index 0 contains the column names as values
therefore below I would be removing this row
df.drop(0, inplace = True)
df.head(25)
| State_Name | District_Name | Crop_Year | Season | Crop | Area(ha) | Production(tonne) | |
|---|---|---|---|---|---|---|---|
| 1 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Arecanut | 1254.00 | 2000.00 |
| 2 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Other Kharif pulses | 2.00 | 1.00 |
| 3 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Rice | 102.00 | 321.00 |
| 4 | Andaman and Nicobar Islands | NICOBARS | 2000 | Whole Year | Banana | 176.00 | 641.00 |
| 5 | Andaman and Nicobar Islands | NICOBARS | 2000 | Whole Year | Cashewnut | 720.00 | 165.00 |
| 6 | Andaman and Nicobar Islands | NICOBARS | 2000 | Whole Year | Coconut | 18168.00 | 65100000.00 |
| 7 | Andaman and Nicobar Islands | NICOBARS | 2000 | Whole Year | Dry ginger | 36.00 | 100.00 |
| 8 | Andaman and Nicobar Islands | NICOBARS | 2000 | Whole Year | Sugarcane | 1.00 | 2.00 |
| 9 | Andaman and Nicobar Islands | NICOBARS | 2000 | Whole Year | Sweet potato | 5.00 | 15.00 |
| 10 | Andaman and Nicobar Islands | NICOBARS | 2000 | Whole Year | Tapioca | 40.00 | 169.00 |
| 11 | Andaman and Nicobar Islands | NICOBARS | 2001 | Kharif | Arecanut | 1254.00 | 2061.00 |
| 12 | Andaman and Nicobar Islands | NICOBARS | 2001 | Kharif | Other Kharif pulses | 2.00 | 1.00 |
| 13 | Andaman and Nicobar Islands | NICOBARS | 2001 | Kharif | Rice | 83.00 | 300.00 |
| 14 | Andaman and Nicobar Islands | NICOBARS | 2001 | Whole Year | Cashewnut | 719.00 | 192.00 |
| 15 | Andaman and Nicobar Islands | NICOBARS | 2001 | Whole Year | Coconut | 18190.00 | 64430000.00 |
| 16 | Andaman and Nicobar Islands | NICOBARS | 2001 | Whole Year | Dry ginger | 46.00 | 100.00 |
| 17 | Andaman and Nicobar Islands | NICOBARS | 2001 | Whole Year | Sugarcane | 1.00 | 1.00 |
| 18 | Andaman and Nicobar Islands | NICOBARS | 2001 | Whole Year | Sweet potato | 11.00 | 33.00 |
| 19 | Andaman and Nicobar Islands | NICOBARS | 2002 | Kharif | Rice | 189.20 | 510.84 |
| 20 | Andaman and Nicobar Islands | NICOBARS | 2002 | Whole Year | Arecanut | 1258.00 | 2083.00 |
| 21 | Andaman and Nicobar Islands | NICOBARS | 2002 | Whole Year | Banana | 213.00 | 1278.00 |
| 22 | Andaman and Nicobar Islands | NICOBARS | 2002 | Whole Year | Black pepper | 63.00 | 13.50 |
| 23 | Andaman and Nicobar Islands | NICOBARS | 2002 | Whole Year | Cashewnut | 719.00 | 208.00 |
| 24 | Andaman and Nicobar Islands | NICOBARS | 2002 | Whole Year | Coconut | 18240.00 | 67490000.00 |
| 25 | Andaman and Nicobar Islands | NICOBARS | 2002 | Whole Year | Dry chillies | 413.00 | 28.80 |
After dropping the rows, I noticed the index values arent in the right order anymore, therefore I decided to properly arrange the index values
df.reset_index(drop=True, inplace=True)
df
| State_Name | District_Name | Crop_Year | Season | Crop | Area(ha) | Production(tonne) | |
|---|---|---|---|---|---|---|---|
| 0 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Arecanut | 1254.00 | 2000.00 |
| 1 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Other Kharif pulses | 2.00 | 1.00 |
| 2 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Rice | 102.00 | 321.00 |
| 3 | Andaman and Nicobar Islands | NICOBARS | 2000 | Whole Year | Banana | 176.00 | 641.00 |
| 4 | Andaman and Nicobar Islands | NICOBARS | 2000 | Whole Year | Cashewnut | 720.00 | 165.00 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 246086 | West Bengal | PURULIA | 2014 | Summer | Rice | 306.0 | 801.0 |
| 246087 | West Bengal | PURULIA | 2014 | Summer | Sesamum | 627.0 | 463.0 |
| 246088 | West Bengal | PURULIA | 2014 | Whole Year | Sugarcane | 324.0 | 16250.0 |
| 246089 | West Bengal | PURULIA | 2014 | Winter | Rice | 279151.0 | 597899.0 |
| 246090 | West Bengal | PURULIA | 2014 | Winter | Sesamum | 175.0 | 88.0 |
246091 rows × 7 columns
Trying to group the data based on seasons in the dataset
df[df['Season'] == 'Kharif']
| State_Name | District_Name | Crop_Year | Season | Crop | Area(ha) | Production(tonne) |
|---|
df[df['Season'] == 'Rabi']
| State_Name | District_Name | Crop_Year | Season | Crop | Area(ha) | Production(tonne) |
|---|
df[df['Season'] == 'Winter']
| State_Name | District_Name | Crop_Year | Season | Crop | Area(ha) | Production(tonne) |
|---|
When i try to group the data based on seasons the information is not shown
df.dtypes
State_Name object District_Name object Crop_Year object Season object Crop object Area(ha) object Production(tonne) object dtype: object
Finding(s)
The columns Area and Production are objects, but I would like these columns to be numerical values,
So mathimatical operations can be performed with/on them
Therefore, I am going to convert these columns to float datatype.
I chose float as the datatype incase of any values which are decimals/fractions
df['Area(ha)'] = df['Area(ha)'].astype(float)
df['Production(tonne)'] = df['Production(tonne)'].astype(float)
df.dtypes
State_Name object District_Name object Crop_Year object Season object Crop object Area(ha) float64 Production(tonne) float64 dtype: object
df.describe()
| Area(ha) | Production(tonne) | |
|---|---|---|
| count | 2.460910e+05 | 2.423610e+05 |
| mean | 1.200282e+04 | 5.825034e+05 |
| std | 5.052340e+04 | 1.706581e+07 |
| min | 4.000000e-02 | 0.000000e+00 |
| 25% | 8.000000e+01 | 8.800000e+01 |
| 50% | 5.820000e+02 | 7.290000e+02 |
| 75% | 4.392000e+03 | 7.023000e+03 |
| max | 8.580100e+06 | 1.250800e+09 |
EDA Next step
This enables me find out the amount of district names that are associated with each street in this dataset
districts_count_per_state = df.groupby(['State_Name'], as_index = False)['District_Name'].count()
districts_count_per_state
| State_Name | District_Name | |
|---|---|---|
| 0 | Andaman and Nicobar Islands | 203 |
| 1 | Andhra Pradesh | 9628 |
| 2 | Arunachal Pradesh | 2546 |
| 3 | Assam | 14628 |
| 4 | Bihar | 18885 |
| 5 | Chandigarh | 90 |
| 6 | Chhattisgarh | 10709 |
| 7 | Dadra and Nagar Haveli | 263 |
| 8 | Goa | 208 |
| 9 | Gujarat | 8436 |
| 10 | Haryana | 5875 |
| 11 | Himachal Pradesh | 2494 |
| 12 | Jammu and Kashmir | 1634 |
| 13 | Jharkhand | 1266 |
| 14 | Karnataka | 21122 |
| 15 | Kerala | 4261 |
| 16 | Madhya Pradesh | 22943 |
| 17 | Maharashtra | 12628 |
| 18 | Manipur | 1267 |
| 19 | Meghalaya | 2867 |
| 20 | Mizoram | 957 |
| 21 | Nagaland | 3906 |
| 22 | Odisha | 13575 |
| 23 | Puducherry | 876 |
| 24 | Punjab | 3173 |
| 25 | Rajasthan | 12514 |
| 26 | Sikkim | 714 |
| 27 | Tamil Nadu | 13547 |
| 28 | Telangana | 5649 |
| 29 | Tripura | 1412 |
| 30 | Uttar Pradesh | 33306 |
| 31 | Uttarakhand | 4896 |
| 32 | West Bengal | 9613 |
Renaming the District_name column which contains the count of districts in a state to District_Count
districts_count_per_state.rename(columns = {'District_Name':'District_Count'}, inplace = True)
districts_count_per_state
| State_Name | District_Count | |
|---|---|---|
| 0 | Andaman and Nicobar Islands | 203 |
| 1 | Andhra Pradesh | 9628 |
| 2 | Arunachal Pradesh | 2546 |
| 3 | Assam | 14628 |
| 4 | Bihar | 18885 |
| 5 | Chandigarh | 90 |
| 6 | Chhattisgarh | 10709 |
| 7 | Dadra and Nagar Haveli | 263 |
| 8 | Goa | 208 |
| 9 | Gujarat | 8436 |
| 10 | Haryana | 5875 |
| 11 | Himachal Pradesh | 2494 |
| 12 | Jammu and Kashmir | 1634 |
| 13 | Jharkhand | 1266 |
| 14 | Karnataka | 21122 |
| 15 | Kerala | 4261 |
| 16 | Madhya Pradesh | 22943 |
| 17 | Maharashtra | 12628 |
| 18 | Manipur | 1267 |
| 19 | Meghalaya | 2867 |
| 20 | Mizoram | 957 |
| 21 | Nagaland | 3906 |
| 22 | Odisha | 13575 |
| 23 | Puducherry | 876 |
| 24 | Punjab | 3173 |
| 25 | Rajasthan | 12514 |
| 26 | Sikkim | 714 |
| 27 | Tamil Nadu | 13547 |
| 28 | Telangana | 5649 |
| 29 | Tripura | 1412 |
| 30 | Uttar Pradesh | 33306 |
| 31 | Uttarakhand | 4896 |
| 32 | West Bengal | 9613 |
Sorting the districts_count_per_state dataset by District_Count in descending order to see the states with the highest districts
districts_count_per_state.sort_values('District_Count', ascending = False, inplace = True)
districts_count_per_state
| State_Name | District_Count | |
|---|---|---|
| 30 | Uttar Pradesh | 33306 |
| 16 | Madhya Pradesh | 22943 |
| 14 | Karnataka | 21122 |
| 4 | Bihar | 18885 |
| 3 | Assam | 14628 |
| 22 | Odisha | 13575 |
| 27 | Tamil Nadu | 13547 |
| 17 | Maharashtra | 12628 |
| 25 | Rajasthan | 12514 |
| 6 | Chhattisgarh | 10709 |
| 1 | Andhra Pradesh | 9628 |
| 32 | West Bengal | 9613 |
| 9 | Gujarat | 8436 |
| 10 | Haryana | 5875 |
| 28 | Telangana | 5649 |
| 31 | Uttarakhand | 4896 |
| 15 | Kerala | 4261 |
| 21 | Nagaland | 3906 |
| 24 | Punjab | 3173 |
| 19 | Meghalaya | 2867 |
| 2 | Arunachal Pradesh | 2546 |
| 11 | Himachal Pradesh | 2494 |
| 12 | Jammu and Kashmir | 1634 |
| 29 | Tripura | 1412 |
| 18 | Manipur | 1267 |
| 13 | Jharkhand | 1266 |
| 20 | Mizoram | 957 |
| 23 | Puducherry | 876 |
| 26 | Sikkim | 714 |
| 7 | Dadra and Nagar Haveli | 263 |
| 8 | Goa | 208 |
| 0 | Andaman and Nicobar Islands | 203 |
| 5 | Chandigarh | 90 |
label = districts_count_per_state['State_Name'][:20]
sizes = districts_count_per_state['District_Count']
plt.figure(figsize= (20, 10))
squarify.plot(sizes = sizes, label = label, alpha = .8)
plt.axis('off')
(0.0, 100.0, 0.0, 100.0)
The states with more districts should have higher crop production than states with less
I have noticed in this dataset there is data about how much a crop was produced for a whole year
therefore for that information to not affect my analysis, I would be removing the whole year column for this analysis but not for my entire dataset
indexNames = df[df['Season'] == 'Whole Year '].index
df_without_whole_year = df.drop(indexNames)
df_without_whole_year
| State_Name | District_Name | Crop_Year | Season | Crop | Area(ha) | Production(tonne) | |
|---|---|---|---|---|---|---|---|
| 0 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Arecanut | 1254.0 | 2000.0 |
| 1 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Other Kharif pulses | 2.0 | 1.0 |
| 2 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Rice | 102.0 | 321.0 |
| 10 | Andaman and Nicobar Islands | NICOBARS | 2001 | Kharif | Arecanut | 1254.0 | 2061.0 |
| 11 | Andaman and Nicobar Islands | NICOBARS | 2001 | Kharif | Other Kharif pulses | 2.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 246085 | West Bengal | PURULIA | 2014 | Summer | Maize | 325.0 | 2039.0 |
| 246086 | West Bengal | PURULIA | 2014 | Summer | Rice | 306.0 | 801.0 |
| 246087 | West Bengal | PURULIA | 2014 | Summer | Sesamum | 627.0 | 463.0 |
| 246089 | West Bengal | PURULIA | 2014 | Winter | Rice | 279151.0 | 597899.0 |
| 246090 | West Bengal | PURULIA | 2014 | Winter | Sesamum | 175.0 | 88.0 |
188786 rows × 7 columns
total_amount_crops_produced_per_state = df_without_whole_year.groupby(['State_Name'], as_index = False)['Production(tonne)'].sum()
total_amount_crops_produced_per_state
| State_Name | Production(tonne) | |
|---|---|---|
| 0 | Andaman and Nicobar Islands | 2.510869e+05 |
| 1 | Andhra Pradesh | 2.984922e+08 |
| 2 | Arunachal Pradesh | 4.298912e+06 |
| 3 | Assam | 1.150845e+08 |
| 4 | Bihar | 2.415800e+08 |
| 5 | Chandigarh | 4.827750e+04 |
| 6 | Chhattisgarh | 9.900568e+07 |
| 7 | Dadra and Nagar Haveli | 1.847871e+06 |
| 8 | Goa | 1.580998e+06 |
| 9 | Gujarat | 2.403215e+08 |
| 10 | Haryana | 2.618656e+08 |
| 11 | Himachal Pradesh | 1.685808e+07 |
| 12 | Jammu and Kashmir | 1.319469e+07 |
| 13 | Jharkhand | 1.068637e+07 |
| 14 | Karnataka | 2.384410e+08 |
| 15 | Kerala | 1.071412e+07 |
| 16 | Madhya Pradesh | 3.578921e+08 |
| 17 | Maharashtra | 6.103534e+08 |
| 18 | Manipur | 4.119722e+06 |
| 19 | Meghalaya | 7.734268e+06 |
| 20 | Mizoram | 1.254633e+06 |
| 21 | Nagaland | 1.152271e+07 |
| 22 | Odisha | 1.530753e+08 |
| 23 | Puducherry | 4.038461e+06 |
| 24 | Punjab | 4.964972e+08 |
| 25 | Rajasthan | 2.581434e+08 |
| 26 | Sikkim | 2.148563e+06 |
| 27 | Tamil Nadu | 1.347562e+08 |
| 28 | Telangana | 1.686930e+08 |
| 29 | Tripura | 1.112094e+07 |
| 30 | Uttar Pradesh | 2.373659e+09 |
| 31 | Uttarakhand | 3.590835e+07 |
| 32 | West Bengal | 5.660915e+08 |
fig = px.bar(total_amount_crops_produced_per_state, x='State_Name', y='Production(tonne)', title='Total amount yield for each state')
fig.update_layout(barmode='group')
fig.show()
My hypothesis about the the states with alot of districts should have the highest amount of crop produced is in-correct
Although according to the graph above the state Uttar pradesh has the highest yield and the squarify graph plotted above showing the how many districts each state has shows that Uttar pradesh is also the state with the most districts but some states e.g(Madhya Pradesh) has the second highest amount of districts but is the 17th highest producing state
I do this because, I want to find out the season with the most and least yield and also see the total production amount for all seasons in the dataset
amount_crops_produced_per_season = df.groupby(['Season'], as_index = False)['Production(tonne)'].sum()
amount_crops_produced_per_season
| Season | Production(tonne) | |
|---|---|---|
| 0 | Autumn | 6.441377e+07 |
| 1 | Kharif | 4.029970e+09 |
| 2 | Rabi | 2.051688e+09 |
| 3 | Summer | 1.706579e+08 |
| 4 | Whole Year | 1.344248e+11 |
| 5 | Winter | 4.345498e+08 |
After grouping the crops produced per season, I found out a season recorded is called whole year which consists of the total amount produced for the whole year,
I would not like this value when plotting therefore I am going to remove the value from the amount_crops_produced_per_season table
Before trying to remove Whole year from Season column, I decided to print out its value from amount_crops_produced_per_season table, I do this to check if the value has any whitespaces or not
amount_crops_produced_per_season.iloc[4, 0]
'Whole Year '
df.Season = df.Season.str.rstrip()
Checking if the whitespaces in the season column has been removed
df.iloc[3, 3]
'Whole Year'
amount_crops_produced_per_season = df.groupby(['Season'], as_index = False)['Production(tonne)'].sum()
amount_crops_produced_per_season
| Season | Production(tonne) | |
|---|---|---|
| 0 | Autumn | 6.441377e+07 |
| 1 | Kharif | 4.029970e+09 |
| 2 | Rabi | 2.051688e+09 |
| 3 | Summer | 1.706579e+08 |
| 4 | Whole Year | 1.344248e+11 |
| 5 | Winter | 4.345498e+08 |
I am dropping the whole year data because I would only like to see how much was produced for each season and not the whole year
#Either one of the codes below can be used in removing the specified value whole year
# amount_crops_produced_per_season.drop([0, 4], axis = 0, inplace = True)
# amount_crops_produced_per_season
indexNames = amount_crops_produced_per_season[amount_crops_produced_per_season['Season'] == 'Whole Year'].index
amount_crops_produced_per_season.drop(indexNames , inplace=True)
amount_crops_produced_per_season
| Season | Production(tonne) | |
|---|---|---|
| 0 | Autumn | 6.441377e+07 |
| 1 | Kharif | 4.029970e+09 |
| 2 | Rabi | 2.051688e+09 |
| 3 | Summer | 1.706579e+08 |
| 5 | Winter | 4.345498e+08 |
amount_crops_produced_per_season.describe()
| Production(tonne) | |
|---|---|
| count | 5.000000e+00 |
| mean | 1.350256e+09 |
| std | 1.699723e+09 |
| min | 6.441377e+07 |
| 25% | 1.706579e+08 |
| 50% | 4.345498e+08 |
| 75% | 2.051688e+09 |
| max | 4.029970e+09 |
def format_number(data_value, indx):
if data_value >= 1000000:
formatter = '{:1.1f}M'.format(data_value*0.000_001)
else:
formatter = '{:1.0f}K'.format(data_value*0.001)
return formatter
fig, ax = plt.subplots(figsize=(15,6))
ax.barh(amount_crops_produced_per_season['Season'], amount_crops_produced_per_season['Production(tonne)'])
ax.set_xlabel('Total amount of Crops produced(tonne)')
ax.set_ylabel('Crop production seasons')
ax.set_title('Total amount of crops produced per season')
ax.xaxis.set_major_formatter(format_number)
plt.show()
I have decided to remove the data regarding whole year from the main dataset,
this is because I do not need this data for my overall analysis and also predictive model
indexNames = df[df['Season'] == 'Whole Year'].index
df.drop(indexNames , inplace=True)
df
| State_Name | District_Name | Crop_Year | Season | Crop | Area(ha) | Production(tonne) | |
|---|---|---|---|---|---|---|---|
| 0 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Arecanut | 1254.0 | 2000.0 |
| 1 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Other Kharif pulses | 2.0 | 1.0 |
| 2 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Rice | 102.0 | 321.0 |
| 10 | Andaman and Nicobar Islands | NICOBARS | 2001 | Kharif | Arecanut | 1254.0 | 2061.0 |
| 11 | Andaman and Nicobar Islands | NICOBARS | 2001 | Kharif | Other Kharif pulses | 2.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 246085 | West Bengal | PURULIA | 2014 | Summer | Maize | 325.0 | 2039.0 |
| 246086 | West Bengal | PURULIA | 2014 | Summer | Rice | 306.0 | 801.0 |
| 246087 | West Bengal | PURULIA | 2014 | Summer | Sesamum | 627.0 | 463.0 |
| 246089 | West Bengal | PURULIA | 2014 | Winter | Rice | 279151.0 | 597899.0 |
| 246090 | West Bengal | PURULIA | 2014 | Winter | Sesamum | 175.0 | 88.0 |
188786 rows × 7 columns
I do this because, I want to find out the year with the most and least yield and also see the totla production amount of all years in the dataset
crop_year_per_prod = df.groupby(['Crop_Year'], as_index = False)['Production(tonne)'].sum()
crop_year_per_prod
| Crop_Year | Production(tonne) | |
|---|---|---|
| 0 | 1997 | 1.965446e+08 |
| 1 | 1998 | 1.674131e+08 |
| 2 | 1999 | 1.528994e+08 |
| 3 | 2000 | 1.407050e+08 |
| 4 | 2001 | 1.492294e+08 |
| 5 | 2002 | 1.306218e+08 |
| 6 | 2003 | 1.519163e+08 |
| 7 | 2004 | 2.745188e+08 |
| 8 | 2005 | 3.020638e+08 |
| 9 | 2006 | 2.992419e+08 |
| 10 | 2007 | 2.848714e+08 |
| 11 | 2008 | 3.210495e+08 |
| 12 | 2009 | 3.371434e+08 |
| 13 | 2010 | 3.805623e+08 |
| 14 | 2011 | 3.066722e+08 |
| 15 | 2012 | 3.068063e+08 |
| 16 | 2013 | 3.330503e+08 |
| 17 | 2014 | 3.168134e+08 |
| 18 | 2015 | 6.935065e+06 |
| 19 | 1997 | 9.374314e+07 |
| 20 | 1998 | 1.222536e+08 |
| 21 | 1999 | 9.294531e+07 |
| 22 | 2000 | 9.348531e+07 |
| 23 | 2001 | 1.025934e+08 |
| 24 | 2002 | 8.631028e+07 |
| 25 | 2003 | 1.121630e+08 |
| 26 | 2004 | 1.053118e+08 |
| 27 | 2005 | 1.284176e+08 |
| 28 | 2006 | 1.173023e+08 |
| 29 | 2007 | 1.207511e+08 |
| 30 | 2008 | 1.475978e+08 |
| 31 | 2009 | 1.197610e+08 |
| 32 | 2010 | 1.788497e+08 |
| 33 | 2011 | 1.622513e+08 |
| 34 | 2012 | 1.683872e+08 |
| 35 | 2013 | 1.418905e+08 |
| 36 | 2014 | 9.820722e+07 |
After trying to group the production amount by each year,
I found out the crop year does not get grouped properly
crop_year_per_prod.head(10).dtypes
Crop_Year object Production(tonne) float64 dtype: object
crop_year_per_prod.tail(10).dtypes
Crop_Year object Production(tonne) float64 dtype: object
After checking, I know all the values in the crop_year column are of the datatype object.
crop_year_per_prod['Crop_Year'].unique()
array([1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007,
2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, '1997', '1998',
'1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006',
'2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014'],
dtype=object)
I found out that each value in the column is seen as unique
By checking if all the values are unique, I found out ater looking closely at the output the values are not of the same datatype.
This is due to the values which the unique functions returned some have quotation marks which signifies stirngs and some dont which signifies numerical values.
Therefore I have to convert the entire column to strings.
df['Crop_Year'] = df['Crop_Year'].astype(int)
Now checking if the values in Crop_Year are unique
df['Crop_Year'].unique()
array([2000, 2001, 2002, 2003, 2004, 2005, 2010, 2006, 1997, 1998, 1999,
2007, 2008, 2009, 2011, 2012, 2013, 2014, 2015])
After the conversion to string all values are unique.
amount_crops_produced_per_year = df.groupby(['Crop_Year'], as_index = False)['Production(tonne)'].sum()
amount_crops_produced_per_year
| Crop_Year | Production(tonne) | |
|---|---|---|
| 0 | 1997 | 2.902877e+08 |
| 1 | 1998 | 2.896667e+08 |
| 2 | 1999 | 2.458447e+08 |
| 3 | 2000 | 2.341903e+08 |
| 4 | 2001 | 2.518229e+08 |
| 5 | 2002 | 2.169321e+08 |
| 6 | 2003 | 2.640792e+08 |
| 7 | 2004 | 3.798306e+08 |
| 8 | 2005 | 4.304814e+08 |
| 9 | 2006 | 4.165442e+08 |
| 10 | 2007 | 4.056225e+08 |
| 11 | 2008 | 4.686473e+08 |
| 12 | 2009 | 4.569043e+08 |
| 13 | 2010 | 5.594120e+08 |
| 14 | 2011 | 4.689235e+08 |
| 15 | 2012 | 4.751936e+08 |
| 16 | 2013 | 4.749408e+08 |
| 17 | 2014 | 4.150207e+08 |
| 18 | 2015 | 6.935065e+06 |
Image of Crops produced per year before whole year data removal
From the graph, I have found that there was a huge increase in the amount of crops produced in the year 2010 and 2014, but a huge decline in 2015, therefore I would have to find out why this is so.
I plan on adding another dataset contaning weather conditions for each season per year, this may help me find out why there was a huge increase 2011 and 2013 then a huge decrease in 2015
After more EDA was done on this dataset, I realised it was the production values from whole year which influenced the spike in the years 2011 and 2014
Once the production values for the whole year season are removed we have a more gradual peak and not a huge spike in production
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(amount_crops_produced_per_year['Crop_Year'], amount_crops_produced_per_year['Production(tonne)'], marker = "o",)
ax.set_ylabel('Total amount of Crops produced(tonne)')
ax.set_xlabel('Crop production year')
ax.set_title('Total amount of crops produced per year')
ax.yaxis.set_major_formatter(format_number)
plt.show()
crops_total_amount_produced_each_year = df.groupby(['Crop_Year', 'Crop'], as_index = False)['Production(tonne)'].sum()
crops_total_amount_produced_each_year
| Crop_Year | Crop | Production(tonne) | |
|---|---|---|---|
| 0 | 1997 | Arhar/Tur | 1655931.0 |
| 1 | 1997 | Bajra | 7410795.0 |
| 2 | 1997 | Banana | 2762324.0 |
| 3 | 1997 | Barley | 1581811.0 |
| 4 | 1997 | Black pepper | 924.0 |
| ... | ... | ... | ... |
| 958 | 2015 | Small millets | 2911.0 |
| 959 | 2015 | Soyabean | 3190.0 |
| 960 | 2015 | Sugarcane | 577157.2 |
| 961 | 2015 | Urad | 27063.1 |
| 962 | 2015 | Wheat | 840.4 |
963 rows × 3 columns
The crop which is mostly farmed should have the highest production e.g(rice is being produced by 5 countries and wheat just 2 rice should have more amounts produced)
all_years = crops_total_amount_produced_each_year['Crop_Year'].unique()
for year in all_years:
fig, ax = plt.subplots(figsize=(20,6))
index = crops_total_amount_produced_each_year[crops_total_amount_produced_each_year['Crop_Year'] == year]
test = index[index['Production(tonne)'] <= 99999]
ax.bar(test['Crop'], test['Production(tonne)'])
ax.yaxis.set_major_formatter(format_number)
ax.set_title(year)
plt.show()
Due to matplotlib not displaying the crop names on the x axis properly,
I decided to switch to using plotly to visualize the data above
for year in all_years:
index = crops_total_amount_produced_each_year[crops_total_amount_produced_each_year['Crop_Year'] == year]
fig = px.bar(index, x='Crop', y='Production(tonne)', title=f'Crop year {year}')
fig.update_layout(barmode='group')
fig.show()
for year in all_years:
index = crops_total_amount_produced_each_year[crops_total_amount_produced_each_year['Crop_Year'] == year]
test = index[index['Production(tonne)'] <= 99999]
fig = px.bar(test, x='Crop', y='Production(tonne)', title=f'Crop year {year}')
fig.update_layout(barmode='group')
fig.show()
crop_count_df = df.groupby(['Crop'])['Crop'].count()
# converting grouped series to a dataframe
crop_count_df = crop_count_df.to_frame()
# renaming crop count column form crop to crop_count
crop_count_df.rename(columns = {'Crop' : 'Crop_Count'}, inplace = True)
# resetting the index of the dataframe to move the index from crop
crop_count_df = crop_count_df.reset_index()
crop_count_df
| Crop | Crop_Count | |
|---|---|---|
| 0 | Arecanut | 134 |
| 1 | Arhar/Tur | 7322 |
| 2 | Bajra | 5267 |
| 3 | Banana | 394 |
| 4 | Barley | 4151 |
| ... | ... | ... |
| 80 | Urad | 9677 |
| 81 | Varagu | 58 |
| 82 | Wheat | 7844 |
| 83 | other misc. pulses | 70 |
| 84 | other oilseeds | 551 |
85 rows × 2 columns
ordered_crop_count = crop_count_df.sort_values(by = 'Crop_Count', ascending = False)
ordered_crop_count = ordered_crop_count.reset_index()
ordered_crop_count
| index | Crop | Crop_Count | |
|---|---|---|---|
| 0 | 62 | Rice | 14976 |
| 1 | 38 | Maize | 13707 |
| 2 | 42 | Moong(Green Gram) | 10109 |
| 3 | 80 | Urad | 9677 |
| 4 | 25 | Groundnut | 8558 |
| ... | ... | ... | ... |
| 80 | 63 | Ricebean (nagadal) | 10 |
| 81 | 54 | Perilla | 9 |
| 82 | 28 | Jobster | 9 |
| 83 | 75 | Tea | 8 |
| 84 | 32 | Kapas | 8 |
85 rows × 3 columns
Crop_count this contains how many times a certain crop appears in the database
plt.figure(figsize=(18,12))
sns.barplot("Crop","Crop_Count",data=crop_count_df).set(title='Crop count')
plt.xticks(rotation=90)
plt.show()
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
top_25_crops = ordered_crop_count.head(25)
top_25_crops
| index | Crop | Crop_Count | |
|---|---|---|---|
| 0 | 62 | Rice | 14976 |
| 1 | 38 | Maize | 13707 |
| 2 | 42 | Moong(Green Gram) | 10109 |
| 3 | 80 | Urad | 9677 |
| 4 | 25 | Groundnut | 8558 |
| 5 | 68 | Sesamum | 8334 |
| 6 | 82 | Wheat | 7844 |
| 7 | 61 | Rapeseed &Mustard | 7472 |
| 8 | 1 | Arhar/Tur | 7322 |
| 9 | 23 | Gram | 7188 |
| 10 | 29 | Jowar | 6879 |
| 11 | 2 | Bajra | 5267 |
| 12 | 72 | Sunflower | 5237 |
| 13 | 69 | Small millets | 4484 |
| 14 | 53 | Peas & beans (Pulses) | 4389 |
| 15 | 37 | Linseed | 4342 |
| 16 | 40 | Masoor | 4176 |
| 17 | 17 | Cotton(lint) | 4156 |
| 18 | 4 | Barley | 4151 |
| 19 | 59 | Ragi | 3965 |
| 20 | 27 | Horse-gram | 3766 |
| 21 | 50 | Other Kharif pulses | 3659 |
| 22 | 48 | Other Rabi pulses | 3180 |
| 23 | 12 | Castor seed | 3166 |
| 24 | 70 | Soyabean | 3151 |
For a better view I have filtered out the crop count to only include the top 25 popular crops or the 25 least popular
plt.figure(figsize=(18,12))
sns.barplot("Crop","Crop_Count",data=top_25_crops).set(title = 'The 25 most popular crops')
plt.xticks(rotation=90)
plt.show()
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
label = ordered_crop_count['Crop'][:25]
sizes = ordered_crop_count['Crop_Count']
plt.figure(figsize= (18, 10))
squarify.plot(sizes = sizes, label = label, alpha = .8)
plt.axis('off')
(0.0, 100.0, 0.0, 100.0)
From the visualization in cell 35 (for each year show the total amount of crops that were produced and the name of the crop), I found that over the years Coconut has the highest amount produced,
but after visualizing the most common crop produced,
I noticed that Coconut which is the crop with the highest production is not a common crop produced
(it is not among the top 25 crops produced)
After removing whole year season data I found out over the years coconut is no longer the crop with the highest production anymore and crops which are among the 25 most popular are the highest produced e.g Rice, Maize, Wheat
least_25_crops = ordered_crop_count.tail(25)
least_25_crops
| index | Crop | Crop_Count | |
|---|---|---|---|
| 60 | 45 | Oilseeds total | 90 |
| 61 | 65 | Samai | 87 |
| 62 | 77 | Tomato | 78 |
| 63 | 83 | other misc. pulses | 70 |
| 64 | 9 | Cabbage | 61 |
| 65 | 11 | Cashewnut | 60 |
| 66 | 81 | Varagu | 58 |
| 67 | 39 | Mango | 55 |
| 68 | 67 | Sapota | 39 |
| 69 | 35 | Lemon | 39 |
| 70 | 36 | Lentil | 31 |
| 71 | 56 | Pome Granet | 21 |
| 72 | 5 | Bean | 20 |
| 73 | 15 | Cond-spcs other | 18 |
| 74 | 60 | Rajmash Kholar | 18 |
| 75 | 13 | Coconut | 17 |
| 76 | 10 | Cardamom | 17 |
| 77 | 31 | Jute & mesta | 12 |
| 78 | 24 | Grapes | 12 |
| 79 | 14 | Colocosia | 11 |
| 80 | 63 | Ricebean (nagadal) | 10 |
| 81 | 54 | Perilla | 9 |
| 82 | 28 | Jobster | 9 |
| 83 | 75 | Tea | 8 |
| 84 | 32 | Kapas | 8 |
plt.figure(figsize=(18,12))
sns.barplot('Crop', 'Crop_Count', data = least_25_crops).set(title='The 25 least popular crops')
plt.xticks(rotation=90)
plt.show()
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
The top 5 crops in this dataset are Rice, Maize, Moongl(Green gram), Urad, Seasamum
The least 5 crops in this dataset are other dry fruit, Apple, Peach, Plums, Pear
top_25_crops = ordered_crop_count['Crop'].head(25)
all_crops = top_25_crops.unique()
all_crops
array(['Rice', 'Maize', 'Moong(Green Gram)', 'Urad', 'Groundnut',
'Sesamum', 'Wheat', 'Rapeseed &Mustard', 'Arhar/Tur', 'Gram',
'Jowar', 'Bajra', 'Sunflower', 'Small millets',
'Peas & beans (Pulses)', 'Linseed', 'Masoor', 'Cotton(lint)',
'Barley', 'Ragi', 'Horse-gram', 'Other Kharif pulses',
'Other Rabi pulses', 'Castor seed', 'Soyabean'], dtype=object)
for crop in all_crops:
crop_df = df[df['Crop'] == crop]
crop_df = crop_df.groupby(['Season'], as_index = False)['Production(tonne)'].sum()
fig, ax = plt.subplots(figsize=(15,5))
sns.barplot('Season', 'Production(tonne)', data = crop_df).set(title= crop)
ax.yaxis.set_major_formatter(format_number)
plt.show()
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
I decided to use plotly here because of its zooming in functionality, therefore users who have access to my notebook can zoom in easily to data which is too small in the graphical representation
for crop in all_crops:
crop_df = df[df['Crop'] == crop]
crop_df = crop_df.groupby(['Season'], as_index = False)['Production(tonne)'].sum()
fig = px.bar(crop_df, x='Season', y='Production(tonne)', title=f'Crop {crop}')
fig.update_layout(barmode='group')
fig.show()
From the graph plotted Kharif season has the highest production amount per season, I have decided to go further and research why?
Result: While trying to find out why kharif season has the highest production, I discovered that Kharif and Rabi which are the strange seasons for me in this dataset are not actually weather seasons but they are better known as crop seasons
i.e Kharif is a crop season for crops which grow during Monsoon/Autumn (July to October) Rabi is a crop season for crops which grow during Winter (October to March) Zaid the third crop season in india is a crop season for crops which grow in Summer (March to June)
Note:
Due to Monsoon in India falling into Autumn I have classified both as the same
Mapping each data value in the season column to its corresponding weather season
weather_seasons = {'Autumn': 'Autumn', 'Kharif':'Autumn', 'Rabi': 'Winter', 'Summer': 'Summer', 'Winter': 'Winter'}
df['Weather_Seasons'] = df['Season'].apply(lambda x: weather_seasons[x])
Here I want to map each crop with its corresponding crop type based on India agriculture(Kharif, Zaid, Rabi)
Step1:
Counting how many times a crop is planted during each weather season, I do this because I want to find the most popular season each crop is planted
By getting this I would be able to tell its crop type
crop_count_df = df.groupby(['Crop', 'Weather_Seasons'])['Crop'].count()
crop_count_df = crop_count_df.to_frame()
crop_count_df
| Crop | ||
|---|---|---|
| Crop | Weather_Seasons | |
| Arecanut | Autumn | 16 |
| Winter | 118 | |
| Arhar/Tur | Autumn | 6816 |
| Summer | 28 | |
| Winter | 478 | |
| ... | ... | ... |
| Wheat | Winter | 7537 |
| other misc. pulses | Autumn | 29 |
| Winter | 41 | |
| other oilseeds | Autumn | 248 |
| Winter | 303 |
181 rows × 1 columns
Renaming the Crop column after count to Crop_Count
crop_count_df.rename(columns = {'Crop':'Crop_Count'}, inplace = True)
Resetting the index of the newly created dataframe, this is to change Crop and Weather_Seasons column from the index
crop_count_df = crop_count_df.reset_index()
crop_count_df
| Crop | Weather_Seasons | Crop_Count | |
|---|---|---|---|
| 0 | Arecanut | Autumn | 16 |
| 1 | Arecanut | Winter | 118 |
| 2 | Arhar/Tur | Autumn | 6816 |
| 3 | Arhar/Tur | Summer | 28 |
| 4 | Arhar/Tur | Winter | 478 |
| ... | ... | ... | ... |
| 176 | Wheat | Winter | 7537 |
| 177 | other misc. pulses | Autumn | 29 |
| 178 | other misc. pulses | Winter | 41 |
| 179 | other oilseeds | Autumn | 248 |
| 180 | other oilseeds | Winter | 303 |
181 rows × 3 columns
Getting the highest crop count for each crop
crop_max_df = crop_count_df.groupby(['Crop'])['Crop_Count'].max()
Converting result to a dataframe
crop_max_df = crop_max_df.to_frame()
crop_max_df
| Crop_Count | |
|---|---|
| Crop | |
| Arecanut | 118 |
| Arhar/Tur | 6816 |
| Bajra | 4672 |
| Banana | 190 |
| Barley | 4017 |
| ... | ... |
| Urad | 6238 |
| Varagu | 45 |
| Wheat | 7537 |
| other misc. pulses | 41 |
| other oilseeds | 303 |
85 rows × 1 columns
Resetting the index of the newly created dataframe, this is to change Crop column from the index
crop_max_df = crop_max_df.reset_index()
crop_max_df
| Crop | Crop_Count | |
|---|---|---|
| 0 | Arecanut | 118 |
| 1 | Arhar/Tur | 6816 |
| 2 | Bajra | 4672 |
| 3 | Banana | 190 |
| 4 | Barley | 4017 |
| ... | ... | ... |
| 80 | Urad | 6238 |
| 81 | Varagu | 45 |
| 82 | Wheat | 7537 |
| 83 | other misc. pulses | 41 |
| 84 | other oilseeds | 303 |
85 rows × 2 columns
Merging both dataframe to give only data entires that exist in both dataframe
crop_count_merged = pd.merge(crop_count_df, crop_max_df)
crop_count_merged
| Crop | Weather_Seasons | Crop_Count | |
|---|---|---|---|
| 0 | Arecanut | Winter | 118 |
| 1 | Arhar/Tur | Autumn | 6816 |
| 2 | Bajra | Autumn | 4672 |
| 3 | Banana | Summer | 190 |
| 4 | Barley | Winter | 4017 |
| ... | ... | ... | ... |
| 83 | Urad | Autumn | 6238 |
| 84 | Varagu | Autumn | 45 |
| 85 | Wheat | Winter | 7537 |
| 86 | other misc. pulses | Winter | 41 |
| 87 | other oilseeds | Winter | 303 |
88 rows × 3 columns
Mapping each crop type to the weather season it is associated with
crop_type = {'Autumn': 'Kharif', 'Summer': 'Zaid', 'Winter': 'Rabi'}
crop_count_merged['Crop_Type'] = crop_count_merged['Weather_Seasons'].apply(lambda x: crop_type[x])
crop_count_merged
| Crop | Weather_Seasons | Crop_Count | Crop_Type | |
|---|---|---|---|---|
| 0 | Arecanut | Winter | 118 | Rabi |
| 1 | Arhar/Tur | Autumn | 6816 | Kharif |
| 2 | Bajra | Autumn | 4672 | Kharif |
| 3 | Banana | Summer | 190 | Zaid |
| 4 | Barley | Winter | 4017 | Rabi |
| ... | ... | ... | ... | ... |
| 83 | Urad | Autumn | 6238 | Kharif |
| 84 | Varagu | Autumn | 45 | Kharif |
| 85 | Wheat | Winter | 7537 | Rabi |
| 86 | other misc. pulses | Winter | 41 | Rabi |
| 87 | other oilseeds | Winter | 303 | Rabi |
88 rows × 4 columns
Dropping weather_seasons in the merged dataframe this is because I do not need this column anymore
crop_count_merged.drop(['Weather_Seasons'], axis = 1, inplace = True)
Merging my main dataframe df and my crop dataframe which consists of the crop type.
I merge with left because this will keep the data entries of both dataframes
df = pd.merge(df, crop_count_merged, how = 'left')
df
| State_Name | District_Name | Crop_Year | Season | Crop | Area(ha) | Production(tonne) | Weather_Seasons | Crop_Count | Crop_Type | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Arecanut | 1254.0 | 2000.0 | Autumn | 118 | Rabi |
| 1 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Other Kharif pulses | 2.0 | 1.0 | Autumn | 3651 | Kharif |
| 2 | Andaman and Nicobar Islands | NICOBARS | 2000 | Kharif | Rice | 102.0 | 321.0 | Autumn | 8969 | Kharif |
| 3 | Andaman and Nicobar Islands | NICOBARS | 2001 | Kharif | Arecanut | 1254.0 | 2061.0 | Autumn | 118 | Rabi |
| 4 | Andaman and Nicobar Islands | NICOBARS | 2001 | Kharif | Other Kharif pulses | 2.0 | 1.0 | Autumn | 3651 | Kharif |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 188968 | West Bengal | PURULIA | 2014 | Summer | Maize | 325.0 | 2039.0 | Summer | 8254 | Kharif |
| 188969 | West Bengal | PURULIA | 2014 | Summer | Rice | 306.0 | 801.0 | Summer | 8969 | Kharif |
| 188970 | West Bengal | PURULIA | 2014 | Summer | Sesamum | 627.0 | 463.0 | Summer | 6805 | Kharif |
| 188971 | West Bengal | PURULIA | 2014 | Winter | Rice | 279151.0 | 597899.0 | Winter | 8969 | Kharif |
| 188972 | West Bengal | PURULIA | 2014 | Winter | Sesamum | 175.0 | 88.0 | Winter | 6805 | Kharif |
188973 rows × 10 columns
After creating all the required columns from Season, I have decided I will delete this column because it will be of no use anymore
df.drop(['Season'], axis = 'columns', inplace = True)
df.drop(['Crop_Count'], axis = 'columns', inplace = True)
df
| State_Name | District_Name | Crop_Year | Crop | Area(ha) | Production(tonne) | Weather_Seasons | Crop_Type | |
|---|---|---|---|---|---|---|---|---|
| 0 | Andaman and Nicobar Islands | NICOBARS | 2000 | Arecanut | 1254.0 | 2000.0 | Autumn | Rabi |
| 1 | Andaman and Nicobar Islands | NICOBARS | 2000 | Other Kharif pulses | 2.0 | 1.0 | Autumn | Kharif |
| 2 | Andaman and Nicobar Islands | NICOBARS | 2000 | Rice | 102.0 | 321.0 | Autumn | Kharif |
| 3 | Andaman and Nicobar Islands | NICOBARS | 2001 | Arecanut | 1254.0 | 2061.0 | Autumn | Rabi |
| 4 | Andaman and Nicobar Islands | NICOBARS | 2001 | Other Kharif pulses | 2.0 | 1.0 | Autumn | Kharif |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 188968 | West Bengal | PURULIA | 2014 | Maize | 325.0 | 2039.0 | Summer | Kharif |
| 188969 | West Bengal | PURULIA | 2014 | Rice | 306.0 | 801.0 | Summer | Kharif |
| 188970 | West Bengal | PURULIA | 2014 | Sesamum | 627.0 | 463.0 | Summer | Kharif |
| 188971 | West Bengal | PURULIA | 2014 | Rice | 279151.0 | 597899.0 | Winter | Kharif |
| 188972 | West Bengal | PURULIA | 2014 | Sesamum | 175.0 | 88.0 | Winter | Kharif |
188973 rows × 8 columns
This graph shows when is the best yield season for the top 25 crops
crop_prod_per_season_df = pd.DataFrame()
for crop in all_crops:
crop_df = df[df['Crop'] == crop]
crop_df = crop_df.groupby(['Weather_Seasons', 'Crop'], as_index = False)['Production(tonne)'].sum()
crop_prod_per_season_df = crop_prod_per_season_df.append(crop_df)
fig = px.bar(crop_df, x='Weather_Seasons', y='Production(tonne)', title =f'{crop}')
fig.update_layout(barmode='group')
fig.show()
From the graph above we can see a more logical representation which states that crops which are mostly popular have a high yield that the least.
E.g
Rice is one of the crops that has a lot of farming done on and whene we look at the graph we can see rice has the highest production and this is true for all other crops analysed
To give a better graphcal representation the graphs below are side by side comaprisons
fig = px.bar(crop_prod_per_season_df, x='Weather_Seasons', y='Production(tonne)', color='Crop')
fig.update_layout(barmode = 'group')
fig.show()
crop_prod_inmill_per_season_df = crop_prod_per_season_df[crop_prod_per_season_df['Production(tonne)'] <= 9999999]
fig = px.bar(crop_prod_inmill_per_season_df, x='Weather_Seasons', y='Production(tonne)', color='Crop')
fig.update_layout(barmode = 'group')
fig.show()
crop_prod_inth_per_season_df = crop_prod_per_season_df[crop_prod_per_season_df['Production(tonne)'] <= 999999]
fig = px.bar(crop_prod_inth_per_season_df, x='Weather_Seasons', y='Production(tonne)', color='Crop')
fig.update_layout(barmode = 'group')
fig.show()
I would like to see for each state the crop which had alot of yield overall, therefore giving me information about the crop states in this dataset produce alot
total_amount_of_crops_per_state = df.groupby(['State_Name', 'Crop'], as_index = False)['Production(tonne)'].sum()
total_amount_of_crops_per_state
| State_Name | Crop | Production(tonne) | |
|---|---|---|---|
| 0 | Andaman and Nicobar Islands | Arecanut | 20300.0 |
| 1 | Andaman and Nicobar Islands | Arhar/Tur | 104.0 |
| 2 | Andaman and Nicobar Islands | Black pepper | 120.0 |
| 3 | Andaman and Nicobar Islands | Cashewnut | 310.0 |
| 4 | Andaman and Nicobar Islands | Dry chillies | 575.0 |
| ... | ... | ... | ... |
| 919 | West Bengal | Small millets | 42033.0 |
| 920 | West Bengal | Soyabean | 7009.0 |
| 921 | West Bengal | Sunflower | 141789.0 |
| 922 | West Bengal | Urad | 756330.0 |
| 923 | West Bengal | Wheat | 15785272.0 |
924 rows × 3 columns
Getting each individual state in the dataset
all_states = total_amount_of_crops_per_state['State_Name'].unique()
all_states
array(['Andaman and Nicobar Islands', 'Andhra Pradesh',
'Arunachal Pradesh', 'Assam', 'Bihar', 'Chandigarh',
'Chhattisgarh', 'Dadra and Nagar Haveli', 'Goa', 'Gujarat',
'Haryana', 'Himachal Pradesh', 'Jammu and Kashmir ', 'Jharkhand',
'Karnataka', 'Kerala', 'Madhya Pradesh', 'Maharashtra', 'Manipur',
'Meghalaya', 'Mizoram', 'Nagaland', 'Odisha', 'Puducherry',
'Punjab', 'Rajasthan', 'Sikkim', 'Tamil Nadu', 'Telangana ',
'Tripura', 'Uttar Pradesh', 'Uttarakhand', 'West Bengal'],
dtype=object)
for state in all_states:
index = total_amount_of_crops_per_state[total_amount_of_crops_per_state['State_Name'] == state]
fig = px.bar(index, x='Crop', y='Production(tonne)', title=state)
fig.update_layout(barmode='group')
fig.show()
After plotting the graph above, I found that most states have high yield of Rice, Wheat, Maize and Sugarcane
To gather more insight, I created a Visualization showing for each state the total amount produced for crops they produce(below 1million)
for state in all_states:
index = total_amount_of_crops_per_state[total_amount_of_crops_per_state['State_Name'] == state]
test = index[index['Production(tonne)'] <= 999999]
fig = px.bar(test, x='Crop', y='Production(tonne)', title=state)
fig.update_layout(barmode='group')
fig.show()
I decided to see if there would be any relationship betweeen area used and amount produced, because this is meant to be logical begin that the higher the area most times the more you produce
area_prod = df.groupby(['Area(ha)', 'Crop_Type'], as_index = False)['Production(tonne)'].sum()
area_prod
| Area(ha) | Crop_Type | Production(tonne) | |
|---|---|---|---|
| 0 | 0.10 | Kharif | 0.08 |
| 1 | 0.28 | Kharif | 0.02 |
| 2 | 0.50 | Rabi | 0.20 |
| 3 | 0.58 | Rabi | 0.31 |
| 4 | 0.80 | Kharif | 1.00 |
| ... | ... | ... | ... |
| 44166 | 3989200.00 | Kharif | 2899500.00 |
| 44167 | 4205800.00 | Kharif | 3183800.00 |
| 44168 | 5251200.00 | Kharif | 3519600.00 |
| 44169 | 5544000.00 | Kharif | 4485800.00 |
| 44170 | 8580100.00 | Kharif | 7556300.00 |
44171 rows × 3 columns
Renaming production column in the new dataframe produced to Total_Production(tonne)
top_30_area_prod = area_prod.head(30)
top_30_area_prod.rename(columns = {'Production(tonne)':'Total_Production(tonne)'}, inplace = True)
C:\Users\mikos\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\frame.py:5034: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
top_30_area_prod
| Area(ha) | Crop_Type | Total_Production(tonne) | |
|---|---|---|---|
| 0 | 0.10 | Kharif | 0.08 |
| 1 | 0.28 | Kharif | 0.02 |
| 2 | 0.50 | Rabi | 0.20 |
| 3 | 0.58 | Rabi | 0.31 |
| 4 | 0.80 | Kharif | 1.00 |
| 5 | 0.97 | Kharif | 0.12 |
| 6 | 1.00 | Kharif | 6322.96 |
| 7 | 1.00 | Rabi | 1886.07 |
| 8 | 1.00 | Zaid | 1120.00 |
| 9 | 1.05 | Rabi | 0.68 |
| 10 | 1.29 | Kharif | 2.00 |
| 11 | 1.50 | Kharif | 2.26 |
| 12 | 1.58 | Kharif | 2.00 |
| 13 | 1.62 | Rabi | 1.31 |
| 14 | 2.00 | Kharif | 5663.90 |
| 15 | 2.00 | Rabi | 2938.25 |
| 16 | 2.00 | Zaid | 737.00 |
| 17 | 2.08 | Kharif | 1.00 |
| 18 | 2.09 | Kharif | 12.06 |
| 19 | 2.20 | Rabi | 1.00 |
| 20 | 2.50 | Kharif | 0.70 |
| 21 | 2.57 | Kharif | 2.00 |
| 22 | 2.73 | Rabi | 2.00 |
| 23 | 2.78 | Kharif | 2.67 |
| 24 | 2.82 | Rabi | 1.00 |
| 25 | 2.90 | Kharif | 1.90 |
| 26 | 3.00 | Kharif | 10099.74 |
| 27 | 3.00 | Rabi | 3377.91 |
| 28 | 3.00 | Zaid | 1018.00 |
| 29 | 3.25 | Rabi | 2.00 |
fig = px.scatter(top_30_area_prod, x='Area(ha)', y='Total_Production(tonne)', color = 'Crop_Type',
title = 'Graph showing ')
fig.update_layout(barmode = 'group')
fig.show()
After creating a graph showing the total amount produced for different crop types depending on the area used,
I decided to gather more insight from the data by finding for each state how much was produced for different crop types depending on how much land was used by that state
crop_type_produced_per_state = df.groupby(['State_Name', 'Area(ha)', 'Crop_Type'], as_index = False)['Production(tonne)'].sum()
crop_type_produced_per_state
| State_Name | Area(ha) | Crop_Type | Production(tonne) | |
|---|---|---|---|---|
| 0 | Andaman and Nicobar Islands | 1.00 | Kharif | 0.50 |
| 1 | Andaman and Nicobar Islands | 1.50 | Kharif | 2.26 |
| 2 | Andaman and Nicobar Islands | 2.00 | Kharif | 2.00 |
| 3 | Andaman and Nicobar Islands | 2.00 | Rabi | 0.50 |
| 4 | Andaman and Nicobar Islands | 2.09 | Kharif | 12.06 |
| ... | ... | ... | ... | ... |
| 91439 | West Bengal | 3989200.00 | Kharif | 2899500.00 |
| 91440 | West Bengal | 4205800.00 | Kharif | 3183800.00 |
| 91441 | West Bengal | 5251200.00 | Kharif | 3519600.00 |
| 91442 | West Bengal | 5544000.00 | Kharif | 4485800.00 |
| 91443 | West Bengal | 8580100.00 | Kharif | 7556300.00 |
91444 rows × 4 columns
for state in all_states:
index = crop_type_produced_per_state[crop_type_produced_per_state['State_Name'] == state]
fig = px.scatter(index, x='Area(ha)', y='Production(tonne)', color='Crop_Type')
fig.show()
Findings:
From the graph, I found that when the area of land increases for each crop type there is an increase in how much was produced for that crop type
To gather more insight I decided to plot the graph with the amount produced should be less than a 1000
for state in all_states:
index = crop_type_produced_per_state[crop_type_produced_per_state['State_Name'] == state]
test = index[index['Production(tonne)'] <= 999]
fig = px.scatter(test, x='Area(ha)', y='Production(tonne)', color='Crop_Type')
fig.show()
The next step in my EDA is to check if any columns have missing values(i.e null value).
df.isna().sum()
State_Name 0 District_Name 0 Crop_Year 0 Crop 0 Area(ha) 0 Production(tonne) 2552 Weather_Seasons 0 Crop_Type 0 dtype: int64
After checking for null values, I found that only the column Production has values which are null.
Due to only one column having null values, I would like to see if the other values of those columns where production is null,
This enables me know if the null value rows will be important or not.
null_df = df[df['Production(tonne)'].isna()]
null_df
| State_Name | District_Name | Crop_Year | Crop | Area(ha) | Production(tonne) | Weather_Seasons | Crop_Type | |
|---|---|---|---|---|---|---|---|---|
| 396 | Andhra Pradesh | ANANTAPUR | 2007 | Moong(Green Gram) | 1000.0 | NaN | Autumn | Kharif |
| 403 | Andhra Pradesh | ANANTAPUR | 2007 | Horse-gram | 1000.0 | NaN | Winter | Kharif |
| 463 | Andhra Pradesh | ANANTAPUR | 2009 | Rapeseed &Mustard | 8.0 | NaN | Winter | Rabi |
| 480 | Andhra Pradesh | ANANTAPUR | 2010 | Other Kharif pulses | 1.0 | NaN | Autumn | Kharif |
| 875 | Andhra Pradesh | CHITTOOR | 2001 | Wheat | 4.0 | NaN | Winter | Rabi |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 187317 | West Bengal | MEDINIPUR WEST | 2010 | Peas & beans (Pulses) | 2.0 | NaN | Winter | Rabi |
| 187697 | West Bengal | MURSHIDABAD | 2006 | Moong(Green Gram) | 200.0 | NaN | Autumn | Kharif |
| 188554 | West Bengal | PURULIA | 2001 | Rapeseed &Mustard | 427.0 | NaN | Winter | Rabi |
| 188585 | West Bengal | PURULIA | 2002 | Rapeseed &Mustard | 522.0 | NaN | Winter | Rabi |
| 188761 | West Bengal | PURULIA | 2008 | Khesari | 2.0 | NaN | Winter | Rabi |
2552 rows × 8 columns
null_df.head(10)
| State_Name | District_Name | Crop_Year | Crop | Area(ha) | Production(tonne) | Weather_Seasons | Crop_Type | |
|---|---|---|---|---|---|---|---|---|
| 396 | Andhra Pradesh | ANANTAPUR | 2007 | Moong(Green Gram) | 1000.0 | NaN | Autumn | Kharif |
| 403 | Andhra Pradesh | ANANTAPUR | 2007 | Horse-gram | 1000.0 | NaN | Winter | Kharif |
| 463 | Andhra Pradesh | ANANTAPUR | 2009 | Rapeseed &Mustard | 8.0 | NaN | Winter | Rabi |
| 480 | Andhra Pradesh | ANANTAPUR | 2010 | Other Kharif pulses | 1.0 | NaN | Autumn | Kharif |
| 875 | Andhra Pradesh | CHITTOOR | 2001 | Wheat | 4.0 | NaN | Winter | Rabi |
| 983 | Andhra Pradesh | CHITTOOR | 2004 | Wheat | 2.0 | NaN | Winter | Rabi |
| 1056 | Andhra Pradesh | CHITTOOR | 2007 | Moong(Green Gram) | 1000.0 | NaN | Autumn | Kharif |
| 1060 | Andhra Pradesh | CHITTOOR | 2007 | Small millets | 1000.0 | NaN | Autumn | Kharif |
| 1061 | Andhra Pradesh | CHITTOOR | 2007 | Sunflower | 1000.0 | NaN | Autumn | Kharif |
| 1150 | Andhra Pradesh | CHITTOOR | 2010 | Linseed | 18.0 | NaN | Winter | Rabi |
null_df.tail(10)
| State_Name | District_Name | Crop_Year | Crop | Area(ha) | Production(tonne) | Weather_Seasons | Crop_Type | |
|---|---|---|---|---|---|---|---|---|
| 182562 | West Bengal | BARDHAMAN | 2005 | Moong(Green Gram) | 1.0 | NaN | Summer | Kharif |
| 182579 | West Bengal | BARDHAMAN | 2006 | Linseed | 2.0 | NaN | Winter | Rabi |
| 185275 | West Bengal | HOOGHLY | 2008 | Arhar/Tur | 1.0 | NaN | Winter | Kharif |
| 185882 | West Bengal | JALPAIGURI | 2004 | Linseed | 878.0 | NaN | Winter | Rabi |
| 187053 | West Bengal | MEDINIPUR WEST | 2001 | Masoor | 134.0 | NaN | Winter | Rabi |
| 187317 | West Bengal | MEDINIPUR WEST | 2010 | Peas & beans (Pulses) | 2.0 | NaN | Winter | Rabi |
| 187697 | West Bengal | MURSHIDABAD | 2006 | Moong(Green Gram) | 200.0 | NaN | Autumn | Kharif |
| 188554 | West Bengal | PURULIA | 2001 | Rapeseed &Mustard | 427.0 | NaN | Winter | Rabi |
| 188585 | West Bengal | PURULIA | 2002 | Rapeseed &Mustard | 522.0 | NaN | Winter | Rabi |
| 188761 | West Bengal | PURULIA | 2008 | Khesari | 2.0 | NaN | Winter | Rabi |
After displaying the rows with null values on the production column, I found that every other data entries associated with the null look important,
therefore I will have to find out why those columns are empty and if I can do anything regarding the rows aside from removing their data entries
sns.heatmap(df.corr(), annot=True)
<AxesSubplot:>
Due to some of my features like weather_season and crop_type are categorical data entries my heatmap does not show a good amount of features,
to expand my heatmap I would be applying one hot encoding on Weather_Seasons and Crop_Type to see if those features have any relationship with my
outcome Production
Applying pandas dummies to implement one hot encoding on the Weather_Seasons column,
I do this because I want to check if the values in weather_seasons have correlations with values from the other numerical columns
dummies = pd.get_dummies(df.Weather_Seasons)
dummies
| Autumn | Summer | Winter | |
|---|---|---|---|
| 0 | 1 | 0 | 0 |
| 1 | 1 | 0 | 0 |
| 2 | 1 | 0 | 0 |
| 3 | 1 | 0 | 0 |
| 4 | 1 | 0 | 0 |
| ... | ... | ... | ... |
| 188968 | 0 | 1 | 0 |
| 188969 | 0 | 1 | 0 |
| 188970 | 0 | 1 | 0 |
| 188971 | 0 | 0 | 1 |
| 188972 | 0 | 0 | 1 |
188973 rows × 3 columns
dummies2 = pd.get_dummies(df.Crop_Type)
dummies2
| Kharif | Rabi | Zaid | |
|---|---|---|---|
| 0 | 0 | 1 | 0 |
| 1 | 1 | 0 | 0 |
| 2 | 1 | 0 | 0 |
| 3 | 0 | 1 | 0 |
| 4 | 1 | 0 | 0 |
| ... | ... | ... | ... |
| 188968 | 1 | 0 | 0 |
| 188969 | 1 | 0 | 0 |
| 188970 | 1 | 0 | 0 |
| 188971 | 1 | 0 | 0 |
| 188972 | 1 | 0 | 0 |
188973 rows × 3 columns
Merging the One hot encoded values to the main dataframe
df1_merged = pd.concat([df, dummies], axis = 'columns')
df1_merged
| State_Name | District_Name | Crop_Year | Crop | Area(ha) | Production(tonne) | Weather_Seasons | Crop_Type | Autumn | Summer | Winter | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Andaman and Nicobar Islands | NICOBARS | 2000 | Arecanut | 1254.0 | 2000.0 | Autumn | Rabi | 1 | 0 | 0 |
| 1 | Andaman and Nicobar Islands | NICOBARS | 2000 | Other Kharif pulses | 2.0 | 1.0 | Autumn | Kharif | 1 | 0 | 0 |
| 2 | Andaman and Nicobar Islands | NICOBARS | 2000 | Rice | 102.0 | 321.0 | Autumn | Kharif | 1 | 0 | 0 |
| 3 | Andaman and Nicobar Islands | NICOBARS | 2001 | Arecanut | 1254.0 | 2061.0 | Autumn | Rabi | 1 | 0 | 0 |
| 4 | Andaman and Nicobar Islands | NICOBARS | 2001 | Other Kharif pulses | 2.0 | 1.0 | Autumn | Kharif | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 188968 | West Bengal | PURULIA | 2014 | Maize | 325.0 | 2039.0 | Summer | Kharif | 0 | 1 | 0 |
| 188969 | West Bengal | PURULIA | 2014 | Rice | 306.0 | 801.0 | Summer | Kharif | 0 | 1 | 0 |
| 188970 | West Bengal | PURULIA | 2014 | Sesamum | 627.0 | 463.0 | Summer | Kharif | 0 | 1 | 0 |
| 188971 | West Bengal | PURULIA | 2014 | Rice | 279151.0 | 597899.0 | Winter | Kharif | 0 | 0 | 1 |
| 188972 | West Bengal | PURULIA | 2014 | Sesamum | 175.0 | 88.0 | Winter | Kharif | 0 | 0 | 1 |
188973 rows × 11 columns
df_merged = pd.concat([df1_merged, dummies2], axis = 'columns')
df_merged
| State_Name | District_Name | Crop_Year | Crop | Area(ha) | Production(tonne) | Weather_Seasons | Crop_Type | Autumn | Summer | Winter | Kharif | Rabi | Zaid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Andaman and Nicobar Islands | NICOBARS | 2000 | Arecanut | 1254.0 | 2000.0 | Autumn | Rabi | 1 | 0 | 0 | 0 | 1 | 0 |
| 1 | Andaman and Nicobar Islands | NICOBARS | 2000 | Other Kharif pulses | 2.0 | 1.0 | Autumn | Kharif | 1 | 0 | 0 | 1 | 0 | 0 |
| 2 | Andaman and Nicobar Islands | NICOBARS | 2000 | Rice | 102.0 | 321.0 | Autumn | Kharif | 1 | 0 | 0 | 1 | 0 | 0 |
| 3 | Andaman and Nicobar Islands | NICOBARS | 2001 | Arecanut | 1254.0 | 2061.0 | Autumn | Rabi | 1 | 0 | 0 | 0 | 1 | 0 |
| 4 | Andaman and Nicobar Islands | NICOBARS | 2001 | Other Kharif pulses | 2.0 | 1.0 | Autumn | Kharif | 1 | 0 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 188968 | West Bengal | PURULIA | 2014 | Maize | 325.0 | 2039.0 | Summer | Kharif | 0 | 1 | 0 | 1 | 0 | 0 |
| 188969 | West Bengal | PURULIA | 2014 | Rice | 306.0 | 801.0 | Summer | Kharif | 0 | 1 | 0 | 1 | 0 | 0 |
| 188970 | West Bengal | PURULIA | 2014 | Sesamum | 627.0 | 463.0 | Summer | Kharif | 0 | 1 | 0 | 1 | 0 | 0 |
| 188971 | West Bengal | PURULIA | 2014 | Rice | 279151.0 | 597899.0 | Winter | Kharif | 0 | 0 | 1 | 1 | 0 | 0 |
| 188972 | West Bengal | PURULIA | 2014 | Sesamum | 175.0 | 88.0 | Winter | Kharif | 0 | 0 | 1 | 1 | 0 | 0 |
188973 rows × 14 columns
Making a heatmap with the one hot encoded dataframe features
sns.heatmap(df_merged.corr(), cmap='BrBG', annot=True)
<AxesSubplot:>
After creaitng a heatmap to see if any features have correlations with my target feature production(tonne),
I noticed area(ha) has a good correlation with production(tonne)
cdf = df_merged.copy()
sns.pairplot(cdf.corr(), height = 1.5)
<seaborn.axisgrid.PairGrid at 0x2a8383d2fa0>
df_merged.to_csv('crop_production_final.csv', index = True)